This is the final project for METCS544 - Bank Churners.

Goal of the project: Analyze the customer attrition data and gather insights about what could be the contributing factors.

Link to dataset : Bank-Churners-Dataset

The dataset contains 8500 records of existing customers and 1627 records of attrited customers.

We will be analyzing the following :-

  1. Attrition by Card-Category
  2. Months-on-book analysis for attrited customers
  3. Income-category/Card-category/Gender analysis for Attrited customers
  4. Distribution of customer’s Age
  5. Review of Central Limit Theorm - Customer’s Age
  6. Sampling methods - Simple Random, Systematic & Stratified

Loading the csv file with raw data

library(modeest)
options(digits=3)
library(sampling)
library(plotly)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'httr':
##   method         from  
##   print.response rmutil
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
setwd("C:/ARCHANA/Boston University MS Applied Data Analytics/METCS544 - Foundations of Analytics with R - Spring1/Final Project/")

raw.data <- read.csv("BankChurners.csv")

Data Pre-processing

Retaining only the columns that are being used for analysis

raw.data <- raw.data[,names(raw.data) %in% c("Attrition_Flag","Customer_Age",
                        "Gender","Income_Category",
                        "Card_Category","Months_on_book")]

Looking for missing values in the data

unique(raw.data$Customer_Age)
##  [1] 45 49 51 40 44 32 37 48 42 65 56 35 57 41 61 47 62 54 59 63 53 58 55 66 50
## [26] 38 46 52 39 43 64 68 67 60 73 70 36 34 33 26 31 29 30 28 27
unique(raw.data$Gender)
## [1] "M" "F"
unique(raw.data$Income_Category) # has 1112 unknown values
## [1] "$60K - $80K"    "Less than $40K" "$80K - $120K"   "$40K - $60K"   
## [5] "$120K +"        "Unknown"
nrow(raw.data[raw.data$Income_Category=="Unknown",])
## [1] 1112
unique(raw.data$Card_Category)
## [1] "Blue"     "Gold"     "Silver"   "Platinum"
unique(raw.data$Months_on_book)
##  [1] 39 44 36 34 21 46 27 31 54 30 48 37 56 42 49 33 28 38 41 43 45 52 40 50 35
## [26] 47 32 20 29 25 53 24 55 23 22 26 13 51 19 15 17 18 16 14

Income-Category field has unknown values, removing these rows from the raw data

raw.data <- subset(raw.data,Income_Category!="Unknown")

Split the raw data into 2 parts - Attrited and Existing Customers

existing.cust <- raw.data[raw.data$Attrition_Flag=="Existing Customer",names(raw.data) %in%
                        c("Attrition_Flag","Customer_Age","Gender","Income_Category",
                          "Card_Category","Months_on_book")]

attrited.cust <- raw.data[raw.data$Attrition_Flag=="Attrited Customer",names(raw.data) %in%
                      c("Attrition_Flag","Customer_Age","Gender","Income_Category",
                        "Card_Category","Months_on_book")]

PART-1 - Analysis for categorical variable - card_category

Card Category refers to the different type of credit cards that are offered to customers - Blue, Silver, Gold & Platinum.

Here is the table representation to view the frequencies of card-category

All Customers -

all.card <- table(raw.data$Card_Category)
all.card
## 
##     Blue     Gold Platinum   Silver 
##     8391      107       15      502

Attrited customers -

attr.card <- table(attrited.cust$Card_Category)
attr.card
## 
##     Blue     Gold Platinum   Silver 
##     1343       19        3       75
sprintf("Percentage of customers who attrited - by Card Category - %s - %.2f%%",
        rownames(attr.card),round((attr.card/all.card)*100,2))
## [1] "Percentage of customers who attrited - by Card Category - Blue - 16.01%"    
## [2] "Percentage of customers who attrited - by Card Category - Gold - 17.76%"    
## [3] "Percentage of customers who attrited - by Card Category - Platinum - 20.00%"
## [4] "Percentage of customers who attrited - by Card Category - Silver - 14.94%"
card.types <- c("Blue","Gold","Platinum","Silver")

plot_ly() %>%
add_trace(x = ~card.types, y = ~all.card, type = 'bar',
            text = all.card, textposition = 'auto',name="All Customers",
            marker = list(color = 'blue',
            line = list(color = 'blue', width = 1.5))) %>%
add_trace(x = ~card.types, y = ~attr.card, type = 'bar',
            text = attr.card, textposition = 'auto',name="Attrited Customers",
            marker = list(color = 'red',
            line = list(color = 'red', width = 1.5))) %>%
layout(title = "Distribution of Card Categories",
                      barmode = 'group',
                      xaxis = list(title = "Card Category"),
                      yaxis = list(title = "No. of Customers"))

INFERENCES -

Analysis of distribution of card categories using barplot


PART-2 - Analysis for numeric variable - months-on-book for attrited customers

Months-on-book refers to the number of months the customer has held the credit-card.

Here is a boxplot of months-on-book to analyze the spread

f <- fivenum(attrited.cust$Months_on_book)
out <- c(f[2]-1.5*(f[4]-f[2]),f[4]+1.5*(f[4]-f[2]))

plot_ly(attrited.cust, y = ~Months_on_book, type="box", 
        name = 'Months-on-book',quartilemethod="exclusive") %>%
    layout(title="Months-on-book - Attrited Customers")
sprintf("The average months-on-book for attrited customers is %g",
        round(mean(attrited.cust$Months_on_book)))
## [1] "The average months-on-book for attrited customers is 36"
sprintf("Most of the attrited customers were with the company for %s months",
        mfv(attrited.cust$Months_on_book))
## [1] "Most of the attrited customers were with the company for 36 months"

INFERENCES -

Histogram to analyze frequencies of the months that customers stayed with the company -

hist(attrited.cust$Months_on_book,main="Months-on-book - Attrited 
     Customers",col="light blue",xlab="months-on-book")

INFERENCE -


PART-3 - Analysis for set of 2 or more variables - Customer income-category/card-category

Income Category refers to the income range of the customer -

Less than $40K, $40K-$60K, $60K-$80K, $80K-$120K, $120K+

Contingency table of income-category & card-category - Attrited customers

sorted.attr.data <- attrited.cust[order(attrited.cust$Income_Category),]

sorted.attr.data$Income_Category <- factor(sorted.attr.data$Income_Category,
                                          levels=c("Less than $40K","$40K - $60K",
                                          "$60K - $80K","$80K - $120K",
                                          "$120K +"))

income.card <- table(sorted.attr.data$Income_Category,
                     sorted.attr.data$Card_Category)

income.card
##                 
##                  Blue Gold Platinum Silver
##   Less than $40K  586    4        2     20
##   $40K - $60K     257    2        1     11
##   $60K - $80K     172    6        0     11
##   $80K - $120K    215    5        0     22
##   $120K +         113    2        0     11

INFERENCES -

Marginal & conditional distribution of income-category & card-category

addmargins(income.card)
##                 
##                  Blue Gold Platinum Silver  Sum
##   Less than $40K  586    4        2     20  612
##   $40K - $60K     257    2        1     11  271
##   $60K - $80K     172    6        0     11  189
##   $80K - $120K    215    5        0     22  242
##   $120K +         113    2        0     11  126
##   Sum            1343   19        3     75 1440
# income-category
round(prop.table(income.card,1),3)
##                 
##                   Blue  Gold Platinum Silver
##   Less than $40K 0.958 0.007    0.003  0.033
##   $40K - $60K    0.948 0.007    0.004  0.041
##   $60K - $80K    0.910 0.032    0.000  0.058
##   $80K - $120K   0.888 0.021    0.000  0.091
##   $120K +        0.897 0.016    0.000  0.087
# card-category
round(prop.table(income.card,2),3)
##                 
##                   Blue  Gold Platinum Silver
##   Less than $40K 0.436 0.211    0.667  0.267
##   $40K - $60K    0.191 0.105    0.333  0.147
##   $60K - $80K    0.128 0.316    0.000  0.147
##   $80K - $120K   0.160 0.263    0.000  0.293
##   $120K +        0.084 0.105    0.000  0.147

INFERENCE -

Mosaic plot - representation of contingency table

mosaicplot(income.card,color=c("pink","purple"),cex.axis=0.6,las=1,
           xlab="Income Category",ylab="Card Category",
           main="IncomeCategory vs CardCategory")

Analysis - Customer Income-Category/Card-Category/Gender

Breakdown for attrited customers -

aincome.card.gender <- table(sorted.attr.data$Income_Category,
                            sorted.attr.data$Card_Category,
                            sorted.attr.data$Gender)

ftable(aincome.card.gender)
##                            F   M
##                                 
## Less than $40K Blue      559  27
##                Gold        4   0
##                Platinum    2   0
##                Silver     17   3
## $40K - $60K    Blue      160  97
##                Gold        2   0
##                Platinum    0   1
##                Silver      4   7
## $60K - $80K    Blue        0 172
##                Gold        0   6
##                Platinum    0   0
##                Silver      0  11
## $80K - $120K   Blue        0 215
##                Gold        0   5
##                Platinum    0   0
##                Silver      0  22
## $120K +        Blue        0 113
##                Gold        0   2
##                Platinum    0   0
##                Silver      0  11

Breakdown for existing customers -

sorted.exis.data <- existing.cust[order(existing.cust$Income_Category),]

sorted.exis.data$Income_Category <- factor(sorted.exis.data$Income_Category,
                                           levels=c("Less than $40K","$40K - $60K",
                                                    "$60K - $80K","$80K - $120K",
                                                    "$120K +"))

eincome.card.gender <- table(sorted.exis.data$Income_Category,
                             sorted.exis.data$Card_Category,
                             sorted.exis.data$Gender)

ftable(eincome.card.gender)
##                             F    M
##                                   
## Less than $40K Blue      2587  230
##                Gold        18    2
##                Platinum     2    0
##                Silver      95   15
## $40K - $60K    Blue       800  618
##                Gold         5    8
##                Platinum     0    0
##                Silver      43   45
## $60K - $80K    Blue         0 1101
##                Gold         0   23
##                Platinum     0    4
##                Silver       0   85
## $80K - $120K   Blue         0 1180
##                Gold         0   16
##                Platinum     0    2
##                Silver       0   95
## $120K +        Blue         0  532
##                Gold         0   16
##                Platinum     0    4
##                Silver       0   49

Bar Plots for existing vs attrited - by Gender -

male <- raw.data[raw.data$Gender=="M",]
female <- raw.data[raw.data$Gender=="F",]

mpercent <- paste0(round(table(male$Attrition_Flag)/nrow(male),2)*100,"%")
fpercent <- paste0(round(table(female$Attrition_Flag)/nrow(female),2)*100,"%")

subplot(
plot_ly(x=c("Attrited","Existing"),y=table(male$Attrition_Flag),type="bar",
        name="Male") %>%
    add_text(text=mpercent,textposition = "top",showlegend=FALSE),
plot_ly(x=c("Attrited","Existing"),y=table(female$Attrition_Flag),type="bar",
        name="Female") %>%
    add_text(text=fpercent,textposition = "top",showlegend=FALSE) %>%
layout(title="Attrition by Gender"),
shareY = TRUE)

PART-4 - Distribution of numeric data - customer’s age

Boxplot of customer’s age to analyze the spread -

subplot(
plot_ly(existing.cust, y = ~Customer_Age, type="box", 
            name = 'Existing',quartilemethod="exclusive"),
plot_ly(attrited.cust, y = ~Customer_Age, type="box", 
        name = 'Attrited',quartilemethod="exclusive") %>%
    layout(title="Age Distribution - Existing vs Attrited Customers"),
shareY = TRUE)
sprintf("The average age for attrited customers is %g",
        round(mean(attrited.cust$Customer_Age)))
## [1] "The average age for attrited customers is 47"
sprintf("The average age for existing customers is %g",
        round(mean(existing.cust$Customer_Age)))
## [1] "The average age for existing customers is 46"

INFERENCE -

Histogram to visualize the distribution -

par(mfrow=c(1,2))

hist(attrited.cust$Customer_Age,main="Attrited Customers",xlab="Customer's Age",
     xlim=c(20,70),col="red")

hist(existing.cust$Customer_Age,main="Existing Customers",xlab="Customer's Age",
     xlim=c(20,80),col="blue")

INFERENCE -

Probability Density Function - Attrited Customers -

mean.a <- mean(attrited.cust$Customer_Age)

sd.a <- sd(attrited.cust$Customer_Age)

data.a <- dnorm(attrited.cust$Customer_Age,mean=mean.a,sd=sd.a)

x.a <- seq(min(attrited.cust$Customer_Age),max(attrited.cust$Customer_Age),5)

plot(attrited.cust$Customer_Age,data.a,pch=19,main="PDF - Attrited Customers",
     xlab="Customer's Age",ylab="Probability Density Function")

Cumulative Density Function - Attrited Customers -

data.b <- pnorm(attrited.cust$Customer_Age,mean=mean.a,sd=sd.a)

plot(attrited.cust$Customer_Age,data.b,pch=19,main="CDF - Attrited Customers",
     xlab="Customer's Age",ylab="Cumulative Density Function")


PART-5 - Central Limit Theorm - Customer’s Age

We have taken 5000 samples of sizes 10,20,30 & 40 and computed sample-means for the same.

set.seed(2633)

data.mean <- round(mean(raw.data$Customer_Age),2)
data.sd <- round(sd(raw.data$Customer_Age),2)

# function to pick 5000 samples and compute sample-means
sample.func <- function(sample.size,sample.count) {
    sample.means <- numeric(sample.count)
    i <- 1
    for (i in 1:sample.count) {
        sample.means[i] <- mean(sample(raw.data$Customer_Age,sample.size,
                                       replace = FALSE))
    }
    return (sample.means)
}

# sample-size: 10
sample.means10 <- sample.func(10,5000)
mean.sm10 <- mean(sample.means10)
sd.sm10 <- sd(sample.means10)

# sample-size: 20
sample.means20 <- sample.func(20,5000)
mean.sm20 <- mean(sample.means20)
sd.sm20 <- sd(sample.means20)

# sample-size: 30
sample.means30 <- sample.func(30,5000)
mean.sm30 <- mean(sample.means30)
sd.sm30 <- sd(sample.means30)

# sample-size: 40
sample.means40 <- sample.func(40,5000)
mean.sm40 <- mean(sample.means40)
sd.sm40 <- sd(sample.means40)

We then plot the distribution of sample-means for sample-size: 40

# Plot
sd3 <- c(mean.sm40-3*sd.sm40,mean.sm40+3*sd.sm40)
# Plot density of sample means
hist(sample.means40,main="Sample-means - Sample-size: 40",
                               col="blue",prob=TRUE,
                            xlab="Sample means - Customer's Age")
abline(v=sd3,col="red") # 3-sd from the mean

INFERENCE -

Analyzing the mean & standard-deviation of the sample-means -

sprintf("Data Mean: %.2f; Data Standard Deviation: %.2f",data.mean,data.sd)
## [1] "Data Mean: 46.33; Data Standard Deviation: 7.93"
sprintf("Sample-size: %g, Mean: %.2f, SD: %.2f",c(10,20,30,40),
        c(mean.sm10,mean.sm20,mean.sm30,mean.sm40),
        c(sd.sm10,sd.sm20,sd.sm30,sd.sm40))
## [1] "Sample-size: 10, Mean: 46.35, SD: 2.49"
## [2] "Sample-size: 20, Mean: 46.36, SD: 1.74"
## [3] "Sample-size: 30, Mean: 46.35, SD: 1.44"
## [4] "Sample-size: 40, Mean: 46.34, SD: 1.26"
sprintf("Data SD/sqrt(sample.size): %.2f; SD-SampleMeans: %.2f",
        (data.sd/sqrt(40)),sd.sm40)
## [1] "Data SD/sqrt(sample.size): 1.25; SD-SampleMeans: 1.26"

INFERENCES -


PART-6 - Sampling Methods

For each of the sampling methods used, we analyze the following -

  1. How reflective is the sample of the actual data - split of existing vs attrited customers, split of card-categories.

  2. Which card-category has the highest attrition compared to findings from the actual data.

  3. Months-on-book analysis for attrited customers - does the sample reflect the findings from actual data.

Simple Random Sampling

Sample-size: 100

set.seed(2633)

srs <- srswor(n=100,N=nrow(raw.data))
srs.sample <- raw.data[srs!=0,]
  1. Reviewing the fraction of existing vs attrited customers picked by the sample -
paste("Fraction of Attrited vs Existing customers in the data - ")
## [1] "Fraction of Attrited vs Existing customers in the data - "
table(raw.data$Attrition_Flag)/nrow(raw.data)
## 
## Attrited Customer Existing Customer 
##              0.16              0.84
paste("Fraction of Attrited vs Existing customers picked by SimpleRandomSampling")
## [1] "Fraction of Attrited vs Existing customers picked by SimpleRandomSampling"
table(srs.sample$Attrition_Flag)/nrow(srs.sample)
## 
## Attrited Customer Existing Customer 
##              0.14              0.86

Split of card-categories in the sample selected -

table(srs.sample$Card_Category)
## 
##   Blue Silver 
##     95      5

INFERENCES -

-We see that the samples picked through SRSWOR approximately reflects the split between attrited/existing customers.

-However, the sample only contains data for Blue and Silver card categories.

  1. Attrition by card-category -
attr.srs <- srs.sample[srs.sample$Attrition_Flag=="Attrited Customer",]

table(attr.srs$Card_Category)
## 
## Blue 
##   14

All 14 samples picked are under the Blue category.Unable to validate the card-category with highest attrition.

  1. Months-on-book for Attrited Customers -
hist(attr.srs$Months_on_book,main="Simple Random Sampling",xlab="Months-on-book",labels=TRUE,col="orange")

6/14 = 42% of the attrited customers were with the company - 35-40 months - similar to the findings from the actual data.

Systematic Sampling

Sample-size: 100

set.seed(2633)

grp <- ceiling(nrow(raw.data)/100) # divide into groups
r <- sample(grp,1) # pick first sample
sys.sample <- raw.data[seq(r, by=grp, length=100),]
  1. Reviewing the fraction of existing vs attrited customers picked by the sample -
paste("Fraction of Attrited vs Existing customers picked by Systematic Sampling")
## [1] "Fraction of Attrited vs Existing customers picked by Systematic Sampling"
table(sys.sample$Attrition_Flag)/nrow(sys.sample)
## 
## Attrited Customer Existing Customer 
##              0.19              0.80

We see that the samples picked through Systematic sampling approximately reflects the split between attrited/existing customers, but SRS yielded a closer split.

Split of card-categories in the sample selected -

table(sys.sample$Card_Category)
## 
##   Blue   Gold Silver 
##     91      2      6
  1. Attrition by card-category -
attr.sys <- sys.sample[sys.sample$Attrition_Flag=="Attrited Customer",]

table(attr.sys$Card_Category)
## 
## Blue Gold 
##   18    1
  1. Months-on-book for Attrited Customers -
hist(attr.sys$Months_on_book,main="Systematic Sampling",xlab="Months-on-book",labels=TRUE,col="purple")

7/12=58% of attrited customers were with the company for 35-40 months as per samples picked from systematic sampling.

Stratified sampling - using proportional sizes based on the Card-category

set.seed(2633)

sorted.data <- raw.data[order(raw.data$Card_Category),]

size <- round(100*table(raw.data$Card_Category)/nrow(raw.data))

paste(c("Size of samples proportional based on card-category - ",size))
## [1] "Size of samples proportional based on card-category - "
## [2] "93"                                                    
## [3] "1"                                                     
## [4] "0"                                                     
## [5] "6"
# Error Encountered in Strata : Error in data.frame(..., check.names = FALSE) : 
 # arguments imply differing number of rows: 0, 1
# This is due to one of the groups(platinum) having a size=0
# Overriding the size to replace 0 with 1 for platinum
size <- c(92,1,1,6) 

strat <- strata(sorted.data,stratanames="Card_Category",size=size,
                method="srswor")

strata.sample <- getdata(sorted.data,strat)
  1. Reviewing the fraction of existing vs attrited customers picked by the sample
paste("Fraction of Attrited vs Existing customers picked by Stratified Sampling")
## [1] "Fraction of Attrited vs Existing customers picked by Stratified Sampling"
table(strata.sample$Attrition_Flag)/nrow(strata.sample)
## 
## Attrited Customer Existing Customer 
##              0.19              0.81

INFERENCE -

Split of card-categories in the sample selected -

table(strata.sample$Card_Category)
## 
##     Blue     Gold Platinum   Silver 
##       92        1        1        6
  1. Attrition by card-category -
attr.str <- strata.sample[strata.sample$Attrition_Flag=="Attrited Customer",]

table(attr.str$Card_Category)
## 
##   Blue Silver 
##     18      1

The sample does not contain data for platinum/gold attrited customers.

  1. Months-on-book for Attrited Customers -
hist(attr.str$Months_on_book,main="Stratified Sampling",xlab="Months-on-book",labels=TRUE,col="pink")

4/13=46% of attrited customers were with the company for 35-40 months as per samples picked from systematic sampling.


Conclusion


Next Steps


Thank you